Fixes fragment loading to properly pack 16b values into a 32b register by diptorupd · Pull Request #2 · ROCm/flashinfer

diptorupd · 2025-09-29T18:35:21Z

The PR fixes a bug in the HIP MMA load_fragment function. The existing implementation was incorrectly packing two 16b half values into a 32b register in a Big endian format. Since CDNA3 is a little endian ISA, it was causing incorrect values to be populated in the registers.

__half a[4];
uint32_t r1, r2;
// Existing logic: r1 = a[0],a[1]
// Fixed logic:    r1 = a[1],a[0]

The PR:

fixes the load_fragment function and simplifies the logic to not require bitwise shifts.
fixes the transpose_4x4_half_registers function that was implemented to work based on the incorrect logic.
Removed the load_fragment_transpose function as it is not required.
Updates the test_mfma_fp32_16x16x16fp16.cpp to use the transpose_4x4_half_registers function.

PS: The test_mfma_fp32_16x16x16fp16.cpp also implements the same compute as needed in compute_sfm_v and correctly implements the traversal logic for the B-matrix.

Copilot

Pull Request Overview

This PR fixes a critical endianness bug in the HIP MMA fragment loading functions. The implementation was incorrectly packing 16-bit half values into 32-bit registers in big-endian format instead of the little-endian format required by CDNA3 ISA.

Fixed load_fragment function to use direct memory access instead of bit shifting
Corrected transpose_4x4_half_registers function to work with the fixed endianness
Removed obsolete load_fragment_transpose function and replaced its usage with proper transposition

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
libflashinfer/include/gpu_iface/backend/hip/mma_hip.h	Core fix for endianness bug in fragment loading and transpose operations
libflashinfer/include/gpu_iface/mma_ops.hpp	Removed obsolete transpose function and fixed template parameter
libflashinfer/tests/hip/test_mfma_fp32_16x16x16fp16.cpp	Updated test to use corrected functions and improved indexing logic

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-29T18:36:23Z

libflashinfer/include/gpu_iface/mma_ops.hpp

 template <typename T>
 __device__ __forceinline__ void load_fragment_transpose_4x4_half_registers(uint32_t* R,
                                                                           const T* smem_ptr) {
-  static_assert(std::is_same<T, int>::value,
+  static_assert(std::is_same<T, __half>::value,
                "Only __half is supported for the 4x4 register transpose");


The static_assert checks for type T but the function call always uses __half. This creates a mismatch - either the template parameter T should be used in the function call, or the static_assert should be removed since T is not used.

Copilot · 2025-09-29T18:36:23Z

libflashinfer/include/gpu_iface/backend/hip/mma_hip.h

 __device__ __forceinline__ void load_fragment_4x4_half_registers(uint32_t* R, const T* smem_ptr) {
-  static_assert(std::is_same_v<T, __half>(), "Only half type is supported");
-  // Each thread loads 4 __half values in two 32b registers.
+  static_assert(std::is_same_v<T, __half>, "Only half type is supported");


The static_assert uses std::is_same_v which requires C++17, but the codebase appears to use C++14/C++11 style elsewhere (e.g., std::is_same<T, __half>::value in the other file). This should be consistent with the project's C++ standard.

Suggested change

static_assert(std::is_same_v<T, __half>, "Only half type is supported");

static_assert(std::is_same<T, __half>::value, "Only half type is supported");

rtmadduri

All these changes look good to me.

This PR introduces a patch to includes Tested with Unit Tests: Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/4 Test #1: MathTest ......................... Passed 3.25 sec Start 2: PosEncTest 2/4 Test #2: PosEncTest ....................... Passed 3.25 sec Start 3: CascadeTest 3/4 Test #3: CascadeTest ...................... Passed 3.24 sec Start 4: PageTest 4/4 Test #4: PageTest ......................... Passed 161.15 sec 100% tests passed, 0 tests failed out of 4

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes

This PR introduces a patch to includes Tested with Unit Tests: Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/4 Test ROCm#1: MathTest ......................... Passed 3.25 sec Start 2: PosEncTest 2/4 Test ROCm#2: PosEncTest ....................... Passed 3.25 sec Start 3: CascadeTest 3/4 Test ROCm#3: CascadeTest ...................... Passed 3.24 sec Start 4: PageTest 4/4 Test ROCm#4: PageTest ......................... Passed 161.15 sec 100% tests passed, 0 tests failed out of 4

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test ROCm#1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test ROCm#2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test ROCm#3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test ROCm#4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test ROCm#6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test ROCm#1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test ROCm#2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test ROCm#3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test ROCm#4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test ROCm#6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test ROCm#1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test ROCm#2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test ROCm#3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test ROCm#4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test ROCm#5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test ROCm#7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test ROCm#8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

ROCm#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes

This PR introduces a patch to includes Tested with Unit Tests: Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/4 Test #1: MathTest ......................... Passed 3.25 sec Start 2: PosEncTest 2/4 Test #2: PosEncTest ....................... Passed 3.25 sec Start 3: CascadeTest 3/4 Test #3: CascadeTest ...................... Passed 3.24 sec Start 4: PageTest 4/4 Test #4: PageTest ......................... Passed 161.15 sec 100% tests passed, 0 tests failed out of 4

This PR fixes some of the unit test failures that occur in Single Decode. It also disables clang formatting of headers. The clang format of headers causes compilation issues. The compiler is unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling clang format fixes these issues ``` Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.31 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.36 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 3.35 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 114.08 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec 100% tests passed, 0 tests failed out of 6 Total Test time (real) = 719.07 sec ```

CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface` This PR has been tested locally ``` Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/6 Test #1: MathTest ......................... Passed 3.40 sec Start 2: PosEncTest 2/6 Test #2: PosEncTest ....................... Passed 3.40 sec Start 3: CascadeTest 3/6 Test #3: CascadeTest ...................... Passed 985.27 sec Start 4: PageTest 4/6 Test #4: PageTest ......................... Passed 112.40 sec Start 5: SingleDecodeTest 5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec Start 6: BatchDecodeTest 6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec 100% tests passed, 0 tests failed out of 6 ``` To replicate the tests ``` cd flashinfer/libflashinfer/tests/hip ``` ``` mkdir build && cd build/ ``` ``` cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ .. ``` ``` make ``` ``` ctest ```

In this PR I remove the `libtorch` dependency and removed `test_page.cpp`. `test_page.cpp` is the only unit test that uses libtorch. However, we also have a pytest for testing page. We will use that for validation. Removing the libtorch dependency will help us speed docker builds and remove additional dependencies. ```Test project /root/flashinfer/libflashinfer/tests/hip/build Start 1: MathTest 1/8 Test #1: MathTest ............................ Passed 0.31 sec Start 2: PosEncTest 2/8 Test #2: PosEncTest .......................... Passed 0.31 sec Start 3: CascadeTest 3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec Start 4: SingleDecodeTest 4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec Start 5: BatchDecodeTest 5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec Start 6: test_mfma_fp32_16x16x16fp16 6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec Start 7: test_transpose_4x4_half_registers 7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec Start 8: test_rowsum 8/8 Test #8: test_rowsum ......................... Passed 0.27 sec 100% tests passed, 0 tests failed out of 8 ```

#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes

diptorupd added 4 commits September 28, 2025 10:15

Fix load_fragment

d79105a

Fix Array OOO access in debug function

06cde05

Fixes based on Copilot review.

243adce

More Copilot review based fixes

51bda7b

diptorupd requested review from Copilot, demandal25 and rtmadduri September 29, 2025 18:35

Copilot AI reviewed Sep 29, 2025

View reviewed changes

rtmadduri approved these changes Oct 1, 2025

View reviewed changes

demandal25 approved these changes Oct 1, 2025

View reviewed changes

diptorupd merged commit bec5bdf into amd-integration Oct 1, 2025
1 check failed

diptorupd deleted the fix/load_fragment branch October 2, 2025 16:45

diptorupd added a commit that referenced this pull request Dec 5, 2025

Fixes fragment loading to properly pack 16b values into a 32b register (

04cfb5e

#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes

diptorupd added a commit that referenced this pull request Feb 2, 2026

Fixes fragment loading to properly pack 16b values into a 32b register (

91dcd0a

#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes fragment loading to properly pack 16b values into a 32b register#2

Fixes fragment loading to properly pack 16b values into a 32b register#2
diptorupd merged 4 commits intoamd-integrationfrom
fix/load_fragment

diptorupd commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

Copilot AI Sep 29, 2025

Uh oh!

rtmadduri left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	static_assert(std::is_same_v<T, __half>, "Only half type is supported");
	static_assert(std::is_same<T, __half>::value, "Only half type is supported");

Conversation

diptorupd commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

rtmadduri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants